Introduction

The dataset used in this report is part of data from a Kaggle competition project ‘Data Science for Good: Center for Policing Equity’. The goal of this project is to bridge the gap resulting from communication problems, suffering and generational mistrust, and to construct the way to public safety, community trust, and equity in the race. (Data Science for Good: Center for Policing Equity, 2022)
The main objective of this study is to analyze and visualize the policing dataset from Dallas, Texas in 2016 in order to clarify whether the inequity in policing exists or not. The report consists of 3 main parts:
I) Data preparation
In this topic, to minimize the complexity of the study, only some variables will be selected to be analyzed and missing value will be cleaned.
II) Data analysis and visualization
- Univariate analysis will be used to analyze the insight meaning of variables and to find out whether some variables can display the inequity in policing or not.
- Then the relation between variables will be studied to clarify the factors which may correspond to the inequity in case it exists.
III) Conclusion
Finally, the overall finding result of the study will be briefly discussed and concluded in this section.

All analysis and statistical studies in this report will be conducted using the R language with RMarkdown. The report is in HTML format generated using RMarkdown.

I) Data preparation

The original dataset contains 47 variables which will not be all used in this study, 35 variables are removed in this section. After that missing values of the remaining 12 variables are cleaned. Then data structure is confirmed and the data type of some variables are convert to what it should be.

II) Data analysis and visualization

Before starting the analysis, the incident information is displayed to understand the overall situation.

Figure1: Overall incident information

Figure1 shows that the incidents distribute all over Dallas. This map can be used to understand the location where the force most frequent occurred and the reason of incident. In this case, we can see that the incident occurred most in the centre area.
Next, we will illustrate the timeline of the incidents to explore whether there is significant meaning in relation to time or not.

From figure2, we can see that the average of cases that occurred at the beginning of the year is a bit higher than at the middle of the year. But this does not significantly explain the relationship between the timeline and the number of incidents.

Then now we start the first analysis, univariate analysis is implemented to investigate the data related to the subject (subject race, subject gender) to explore if there is significant inequality in policing between each race and each gender of the subject or not.

Figure3 clearly shows that the force is most used on Blacks (1294 cases), the frequency is more than 2 times of the second rank victim of the use of force which is Hispanics (511 cases), while the Whites are in the third rank with 460 cases (almost 3 times less than Blacks).
This absolute number of cases might not be certainly able to explain the inequity in policing, it might be required the total number of the population of each race and find the ratio between the use of force cases and the total population. But in this study, the total number of citizens is not provided, so we assume that these absolutes justify the racial disparity.

Next, we study the subject gender to clarify whether there is an inequality between gender or not.

The pie chart illustrates that the force was used on males more than on females almost 4.5 times. This also explains that there might be a gender disparity.

Next, we focus on the officer’s information, to confirm if there is any aspect displayed to the inequality.

Figure5 demonstates that the biggest ratio of the officers who applied the force on the subject is white police followed by Hispanics and Blacks respectively. This might be able to explain the disparity of race or maybe just because the Whites work as police more than other races, we need more data to investigate in detail and to conclude this.

While figure6 shows that the officers with 1-10 years of experience are the most use of force. The biggest amount of the ratio is 2 years of experienced officers.

Next, we use boxplots to understand more about officers’ information.

Comparing the year on the force of the white officers (biggest ratio) and the black officers, we found that the average working year of black polices are higher than whites and the distribution is also wider ( the range of working experience is wider). While males work longer than females on average.

From the previous section, we assume that there is a racial and gender disparity in policing. In this section, the bivariate analysis will be applied to study the relationship between variables to explain the following questions.
- Is the injury from force relates to the subject’s race?
- The officer’s race correlates to the subject’s race?
- The subject’s injury has some relation to their race and gender?
- The officer’s level (OFFICER_YEARS_ON_FORCE) has an influent on the subject’s injury (by race)?

Injury by Subject’race : Now we investigate a bit more in detail how the force that occurred to each subject’s race caused the injury to the subjects.

From the figure8, we can see that the number of injuries corresponds to the number of the force used. Since the force is most used on Blacks, the injuries also happened to Blacks which is normal.

Subject’s race by officer’s race: here we will study whether the officer’s race influences the use of force on each subject’s race or not.

The figure obviously shows that more than half of the force used on Blacks, Hispanics and Whites has arisen from the white police. It might be one of the signs of the racial disparity or it might just only because the ratio of the white police is bigger than other races.

Subject injury by race and gender: in this section, we will conduct more investigation on the injury. For each subject’s race and gender, we will study if there is a significant relation to the injury or not.

The result from figure10 displays the same proportion of injury for each race of subjects and each gender. We can assume that the subject’s race and gender do not influence the number of injuries, only the number of cases relate to the number of injuries.

Subject’s race by officer’s position (OFFICER_YEARS_ON_FORCE): lastly, we will confirm if the experience of officers correlates to the force used in each race of subjects or not.

Related to other previous charts, the forces are used the most on Blacks and they are the biggest group to get an injury, most cases caused by the officers with 1-10 years of experience.

III) Conclusion

From the analysis of the policing dataset from Dallas, Texas in 2016, we can conclude that there are racial and gender disparities in policing. Most of the forces are used on Blacks, Hispanics and Whites respectively. The officer’s race is one of the suspect factors that relate to the racial bias. The working year of the officers also displays some correspondence to the disparity, this result can be used to include some training courses for each level of officers. This is just a preliminary investigation, to make the results more precise and reliable, more data are required.